Speech recognition involving a mobile device

ABSTRACT

A system and method of speech recognition involving a mobile device. Speech input is received ( 202 ) on a mobile device ( 102 ) and converted ( 204 ) to a set of phonetic symbols. Data relating to the phonetic symbols is transferred ( 206 ) from the mobile device over a communications network ( 104 ) to a remote processing device ( 106 ) where it is used ( 208 ) to identity at least one matching data item from a set of data items ( 114 ). Data relating to the at least one matching data item is transferred ( 210 ) from the remote processing device to the mobile device and presented ( 214 ) thereon.

The present invention relates to speech recognition involving a mobile device.

Speech recognition involving mobile devices, such as mobile telephones, is conventionally carried out either entirely locally on the processor fitted in the device/phone itself or by sending the speech waveform over the network to a server. Servers have to be used when the processor (if any) in the phone has inadequate power or inadequate memory resources to carry out the speech recognition satisfactorily. Servers also have to be used when the information needed for the speech recognition must be held centrally because it is too sensitive to distribute to the phone subscribers (e.g. a list of all account holders at a bank), is too large or needs to be updated very frequently. Server-based arrangements incur costs, normally borne by the subscriber or by the network service provider, for the transmission of the encoded speech waveforms over the network. Also, especially when the network connection is poor, undesirable delays (latencies) are introduced into the speech recognition response. Transmission of the speech waveform over the network also normally requires the signal to be restricted to the telephone bandwidth (roughly 300 Hz to 3.3 kHz) or at least the region below 4 KHz, and the necessary encoding of the waveform inevitably introduces distortions.

Some work has been carried out on “distributed speech recognition”, in which the power spectrum parameters, typically mel-frequency cepstrum components (MFCCs), are computed locally and transmitted to the server where recognition is carried out. The sets of MFCCs, which are usually called “frames”, are computed at a rate of around 100 per second. The main motivation for this arrangement compared with the conventional server-based arrangement is the avoidance of coding distortions in transmitting the speech waveform, as well as the possibility of analyzing a wider bandwidth than that normally used for telephone communications. There has been an attempt to define an international standard for distributed speech recognition in the Aurora project of the ETSI standards organisation (see http://portal.etsi.org/stg/kta/dsr/dsr.asp, for instance). However, apart from some use in Japan, the technique has not been widely used in practical situations.

Embodiments of the present invention provide an alternative approach to distributed speech recognition and have advantages over the conventional methods described above.

Embodiments of the present inventors' approach are novel and unusual in that the speech to be recognized is first converted to a (necessarily errorful) sequence of phonetic symbols. This sequence is significantly more compact than the encoded speech waveform or the MFCCs. Thus, transmitting the phonetic sequence from a mobile device over the network to a server is much faster and less expensive. The phonetic sequence can then be used on the server to search through very large lists of items and one or more good matches can be returned to the mobile device. One example of a complete speech recognition process that uses an initial conversion to a sequence or lattice of phonetic symbols is described in U.S. Pat. No. 7,146,319 (“Phonetically Based Speech Recognition System and Method”), the contents of which are incorporated herein by reference, although the techniques described in this specification may be implemented using other speech recognition processes. In some embodiments, a detailed representation of the speech is retained on the mobile device in order to allow further refinement to be carried out locally to make a more accurate decision on the best matching items.

Embodiments of the system/process have the advantage, shared by other methods involving servers, that the full database being searched and the pronouncing dictionary never has to be downloaded to the mobile device/phone, but, as noted, it avoids heavy data transmission with the consequential undesirable latency. There is a particular advantage with small form factor devices and smart phones, such as the Apple iPhone™, which have a reasonably powerful processor, but make it difficult to incorporate into local speech recognition applications large databases (e.g. the set of all streets in the US that is needed to have a spoken address be recognized for an in-car navigation aid) and thus make it difficult to carry out the complete recognition process locally.

According to a first aspect of the present invention there is provided a method of speech recognition involving a mobile device, the method including:

receiving speech input on a mobile device;

converting the speech input to a set of phonetic symbols on the mobile device;

transferring data relating to the phonetic symbols from the mobile device over a communications network to a remote processing device;

using the data relating to the phonetic symbols on the remote processing device to identify at least one matching data item from a set of data items;

transferring data relating to the at least one matching data item from the remote processing device to the mobile device, and

presenting data relating to the at least one transferred matching data item using the mobile device.

The identifying of at least one matching data item may include matching the data relating to the phonetic symbols against a set of phonetic reference forms corresponding to the set of data items.

The set of phonetic symbols may comprise at least one sequence (or lattice) of phonetic symbols. The lattice can comprise a directed acyclic graph used as a compact representation of a set of similar sequences.

A number of the matching data item(s) transferred from the remote processing device to the mobile device can correspond to a lower of: a maximum number of data items to be presented on the mobile device, or a number of data items arranged in order of decreasing posterior probability of corresponding to the phonetic symbols down to a predetermined threshold value. The posterior probability for a said data item can be computed by taking a match score of the phonetic symbols against the phonetic reference form of the data item and normalizing the match score.

The step of presenting data relating to the at least one transferred matching data item can comprise displaying an orthographic representation of the at least one transferred matching data item. Alternatively, the data presented may correspond to a map coordinate of a particular location represented by the matching data item. Alternatively, the data presented may correspond to an identification code for a piece of music or media represented by the transferred matching data item. The step of presenting data relating to the at least one transferred matching data item can comprise using a speech synthesis technique executing on the mobile device to generate a spoken form of the at least one transferred matching data item.

The method may further include the mobile device storing data representing the speech input (e.g. as a speech waveform or a sequence of frames of MFCCs or other representation of a short-term acoustic power spectrum corresponding to the speech input) and the mobile device performing a rescoring process using the at least one transferred matching data item and the stored speech input data. The rescoring process can include (for each said matching data item to be rescored) generating sequences or lattices of acoustic hidden Markov models (HMMs) of phonetic units corresponding to a phonetic specification of reference pronunciations corresponding to a said transferred matching data item. Each of the sequences or lattices can be matched against a sequence of frames of spectrum parameters corresponding to the speech input data stored in the mobile device to produce a match score. The matching of the sequences or lattices may involve Viterbi time alignment or a full forward probability method. The rescoring process may involve producing a network including data representing phonetic specifications of all of the data items in the set.

The method may further include the mobile device sharing common elements in the phonetic specification in order to reduce an amount of data transferred to the mobile device for the rescoring process. The method may further include transferring data specifying a general-purpose sub-grammar representing a set of alternative number-word sequences to the mobile device (rather than transferring data fully describing the alternative number-word sequences) for the rescoring process, and the mobile device may use the sub-grammar to determine a most likely one of the number-words sequences to be presented.

The rescoring process may include:

transferring phonetic specifications corresponding to the matching data items from the remote processing device the to the mobile device with corresponding indices instead of a full description of the matching data items;

the mobile device selecting which of the phonetic specification(s) is/are to be presented;

transferring the index or indices corresponding to the selected phonetic specification(s) from the mobile device to the remote processing device;

the remote processing device transferring data representing a full description(s) of the data item(s) corresponding to the index or indices transferred by the mobile device back to the mobile device, and

the mobile device presenting the data representing the full description(s).

The method may include transferring from the remote processing device the matching data items corresponding to a predetermined number of best matches of the phonetic symbols and the data items in the set.

According to a further aspect of the present invention there is provided a computer program product configured to execute methods substantially as described herein.

According to another aspect of the present invention there is provided a system adapted to perform speech recognition involving a mobile device, the system including:

a mobile device configured to receive speech input;

the mobile device including a device configured to convert the speech input to a set of phonetic symbols on the mobile device;

the mobile device further configured to transfer data relating to the phonetic symbols to a remote processing device over a communications network;

the remote processing device including a device configured to use the data relating to the phonetic symbols to identify at least one matching data item from a set of data items;

the remote device including a communications device configured to transfer data relating to the at least one matching data item to the mobile device over the communications network, and

the mobile device including a device configured to present data relating to the at least one transferred matching data item.

The mobile device may comprise a mobile telephone. The communications network may include a GSM, CDMA or UMTS network.

According to another aspect of the present invention there is provided a mobile device adapted to perform speech recognition including:

a device configured to receive speech input;

a device configured to convert the speech input to a set of phonetic symbols on the mobile device;

a device configured to transfer data relating to the phonetic symbols to a processing device over a communications network.

According to another aspect of the present invention there is provided a processing device adapted to perform speech recognition involving a mobile device, the processing device including:

a device configured to receive data relating to a set of phonetic symbols from a mobile device over a communications network;

a device configured to use the data relating to the phonetic symbols to identify at least one matching data item from a set of data items;

a communications device configured to transfer data relating to the at least one matching data item to the mobile device over the communications network.

According to yet another aspect of the present invention there is provided a mobile device configured to execute a method substantially as described herein.

According to further aspect of the present invention there is provided a processing device adapted to perform speech recognition involving a mobile device, the processing device configured to execute a method substantially as described herein.

According to yet another aspect of the present invention there is provided a speech recognition system in which the speech to be recognized is spoken by the user into a mobile device/phone where it is reduced to a sequence or lattice of phonetic symbols, which are then transmitted over a data communications network to a server which matches the phonetic information against a set of phonetic reference forms corresponding to a set of items to be recognized and identifying information suitable for presenting to the user for one or more of the best matching items is transmitted back to the mobile phone and presented to the user.

Information on a multiplicity of best matching items can be returned to the mobile device and may be subjected to a refinement process using a stored representation of the original input speech. The information on the multiplicity of best matching items can include phonetic specifications of the one or more expected pronunciations of each of the items. The refinement may comprise matching a sequence of acoustic parameter frames derived from the input speech against a sequence or lattice of acoustic models representing phonetic units corresponding to the possible pronunciations of the items being matched. The acoustic models may comprise hidden Markov models.

The identifying information may comprise a textual form of the spoken input, or some close equivalent. The identifying information may comprise data needed by a speech synthesis system located in the mobile telephone to generate speech allowing the user to know which item has or multiplicity of items have been recognized. Identifying indices corresponding to the item or plurality of items found to match best by the refinement process can be transmitted back to the server, which then returns data suitable to inform the user as to the identity of the items.

While the invention has been described above, it extends to any inventive combination of features set out above or in the following description. Although illustrative embodiments of the invention are described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to these precise embodiments. As such, many modifications and variations will be apparent to practitioners skilled in the art. Furthermore, it is contemplated that a particular feature described either individually or as part of an embodiment can be combined with other individually described features, or parts of other embodiments, even if the other features and embodiments make no mention of the particular feature. Thus, the invention extends to such specific combinations not already described.

The invention may be performed in various ways, and, by way of example only, embodiments thereof will now be described, reference being made to the accompanying drawings in which:

FIG. 1 illustrates schematically an example system where a mobile device communicates with a server;

FIG. 2 illustrates schematically example steps performed by the mobile device and the server, including a re-scoring process, and

FIG. 3 illustrates schematically example steps performed during the re-scoring process.

FIG. 1 shows an example system where a mobile device 102 can communicate via a network 104 with a server 106 in order to perform a speech recognition process. The mobile device 102 may be a mobile telephone, such as an Apple iPhone™, or another type of device, such as a Personal Digital Assistant or a portable computer with audio input/output (e.g. microphone 102A and speaker 102B) and wireless communications capabilities. The mobile device can further include a processor 103A and memory 103B for executing applications, which can involve speech recognition, as well as a display 103C.

The network 104 may comprise a Global System for Mobile (GSM) communications network, but use of other types of networks is possible. The server 106 can comprise a general purpose computer configured to communicate with other devices over the network using an interface 107 and further comprises a processor 108 and memory 110. The memory 110 of the server 106 includes a speech recognition application 112, which may be based on a speech recognition process such as that described in U.S. Pat. No. 7,146,319, for example. The memory further includes a database 114, which can be used/searched by the speech recognition application to find data item(s) that best matches user input. The database can comprise a set of data items representing any type of information, such as a telephone number directory, music track information, etc. In an alternative embodiment the database 114 can stored be on another remote device that is in communication with the server 106.

Referring to FIG. 2, steps occurring during example use of the components of FIG. 1 are shown. At step 202 speech input from a user is received by the mobile device 102. This can, for instance, involve a telephone number directory application executing on the processor 103B of the mobile device prompting the user to say the name and/or other detail, e.g. partial address, of a person/entity whose telephone number is desired into the microphone 102A. It can also involve known techniques such as analogue to digital conversion or spectrum analysis, which will be familiar to the skilled person.

At step 204 the speech input is converted into data representing a set of phonetic symbols. The set of phonetic symbols can comprise one or more sequences of phonetic symbols that are considered to comprise close contenders for matching the speech input. Here, the term “set” is to be interpreted broadly and is not limited to data having any particular constrains in terms of order, uniqueness, etc. Because they all describe the same speech input, the sequences can be efficiently represented as a directed acyclic graph, or lattice. In some cases, a single sequence is sufficient. Here, a “phonetic symbol” can correspond to what phoneticians call a “segment”, which itself normally corresponds to a single phoneme, though the use is not excluded of phonetic units somewhat smaller than a phoneme, such as the closure, release and aspiration components of a voiceless plosive, or units somewhat larger than a phoneme, such as /sk/. The phonetic symbols can be considered to comprise a computer-generated approximate phonetic transcription of the speech input. The conversion of audio data to phonetic symbols can be performed by a phonetic decoder algorithm that is called by, or is part of, the telephone directory application executing on the mobile device. The publicly available toolkit, HTK, can, for example, be used to construct a suitable phonetic decoder, and to build the models that it needs from a corpus of training speech. The decoder may be based on Bellman's Dynamic Programming. The HTK toolkit can currently be obtained from http://htk.eng.cam.ac.uk, which also provides access to “The HTK Book”, by S. J. Young et al.

At step 206 data relating to the phonetic symbols is transmitted from the mobile device 102 via the network 104 to the server 106. The data transmitted may comprise a direct representation of the sequence(s), or can comprise a version of the sequence(s) that has been compressed, encrypted and/or coded/processed/re-ordered in some other way. Transmitting the phonetic symbol data over the network is much faster and less expensive than transmitting an encoded speech waveform or MFCCs, as with conventional server-based approaches. Upon receipt, the server may process the transmitted data to perform any necessary de-compression, de-encryption, etc, operations.

At step 208 the phonetic symbol data is used by the server speech recognition application 112 to search at least one very large list of data items in the database 114 using known techniques, including cheap symbol-matching operations, such as described in U.S. Pat. No. 7,146,319 and U.S. Pat. No. 7,403,941. Thus, the server process can determine a list of one or more items that best match the input spoken by the user of the mobile device 102. The length of the list can be limited to what can be presented to the user on the mobile device. Data relating to one or more good matches in the list found by the search can then be transmitted over the network 104 to the mobile device at step 210.

The length of the list returned to the mobile device in some embodiments can be the shorter of: the maximum length of list that can be presented on the mobile device, and the list with items in order of decreasing posterior probability of corresponding to the input down to some threshold minimum value (for example, 5%). The posterior probability for a particular item would typically be computed by taking the match score of the phonetic representation of the input speech against the reference representation of that item and “normalizing” that score by dividing it by the sum of a large number of well matching scores, where an appropriate large number is 100.

The information to be presented to the user (at step 214) can comprise an orthographic representation of the items in the returned list, but in some cases the information may be delivered in some other form, such as map coordinates of a particular location, the identification code for a piece of music or media, or information that causes a text-to-speech synthesis system to generate spoken forms of the item(s) to be presented to the user.

In a more complex embodiment that is intended to produce more accurate results, a list of information corresponding to a set of well-matching items is returned to the mobile device, with the mobile phone processor then performing a refinement process known as “rescoring” (optional step 212 of FIG. 2). The information for each item consists of one or a plurality of sequences or lattices specifying the reference pronunciations of that item and a tag identifying the item. By retaining a detailed representation of the speech in the memory 103B of the mobile device (e.g. a speech waveform or a sequence of frames of MFCCs or other representation of the short-term acoustic power spectrum), this optional further refinement can be carried out locally to make a more accurate decision on the best matching items. The refinement process requires an increase in the amount of data transferred from the server to the mobile device, but this increase will still result in a smaller transfer of data than use of conventional distributed speech recognition methods. Furthermore, data transmission speeds are normally several times greater in the download (server to mobile device) direction than in the upload (mobile device to server) direction.

Referring to FIG. 3, for each item to be rescored, a process executing on the processor 103A of the mobile device 102 generates (step 302) sequences or lattices of acoustic hidden Markov models (HMMs) of phonetic units corresponding to the phonetic specification of the reference pronunciations received from the server. Each of these model sequences or lattices is then matched (step 304) against the sequence of frames of spectrum parameters corresponding to the input stored in mobile device memory 103B to produce a more reliable match score. The matching process can be identical to that used in standard HMM-based speech recognition and may involve so-called Viterbi time alignment (finding the single best alignment and summing the match log probability along it), or may comprise the so-called full forward probability method, which determines the probability of the particular sequence of HMMs having given rise to the input taking all possible time alignments into account. These steps are carried out for all the items returned to the mobile device from the server 106 and the best matching item, or the best N matching items (where N is a small number greater than 1 and less than some maximum, for which a reasonable value might be 5) can be presented to the user (e.g. step 306).

An efficient way of carrying out rescoring is to compile the phonetic specifications of all items in the set to be rescored into a network. In this case, it requires less computation to obtain the top-scoring item than to obtain a set of N top-scoring items. The list to be presented to the user can then comprise the top rescored item together with the N-1 remaining top-scoring items from the phonetic matching process. There is an obvious generalization, in which the top M rescored items (M>1 and <=N) is displayed followed by the remaining N-M top-scoring items from the phonetic matching process.

In some embodiments, the total amount of data exchanged between the server 106 and the mobile device 102 is reduced or minimized. The system recognizes that the transmission of phonetic symbols massively reduces the amount of data transmitted from the mobile phone to the server. However, in the case of rescoring, the amount of data needing to be transmitted from the server to the mobile phone can be large. Such may be the case when many items are sent back for rescoring, and especially if each one is accompanied by the associated information to be presented to the user.

One technique to reduce the amount of data to be transmitted to the mobile phone is to exploit overlap between the various items. Since the items being sent for rescoring all match the spoken input reasonably well, it is usually the case that many of them resemble each other quite closely. For example, in the case of addresses, typically many items to be rescored share the same city and state. By sharing common elements in the phonetic specification, the amount of data to be transmitted can be significantly compressed. This compression technique can also be applied to the information to be displayed.

Further, in some applications there can be positions in the phrases where it may be advantageous to use general-purpose sub-grammars. An example is house numbers in street addresses. There is likely to be considerable ambiguity remaining after symbolic scoring—a problem that is aggravated by the fact that there are many ways to say some numbers (e.g. 2021 as “two oh two one”, or “twenty-two twenty-one” or “two thousand [and] twenty one” etc.). Rather than sending a possibly large set of alternative number-word sequences to the mobile device 102, it may be more efficient for the server 106 to transmit the name of an appropriate sub-grammar, and leave it to the rescoring process to determine the most likely sequence of number words, which can then be transformed to a suitable representation such as the corresponding digit sequence.

Another technique for reducing the total amount of data to be transmitted to the mobile device 102 reduces the amount of presentation data needing to be transmitted by adding a further exchange of transmissions. Instead of a full description of the item(s) to be potentially presented to the mobile device user being transmitted at each and every exchange between the mobile device and the server, the phonetic specifications are transmitted with simple indices replacing a full description of the item(s) to be presented to the user. When the set of items to be displayed has been finally determined, their indices are transmitted back to the server 106, which then sends a full description of the presentation information for only those items to the mobile device.

In practice, the top-scoring item or items determined by rescoring is usually in the top few items determined by the phonetic symbol matching process. A further alternative is therefore to transmit in the first transmission from the server the presentation information for just the top few items. The mobile device will then only request additional presentation information if the rescoring determines that items not in the top few should be presented.

With regard to embodiments described herein, each may be implemented using logic or instructions executed by processors or processing resources, with access to memory and data structures from which results may be obtained or generated. Servers and other devices described that carry out functions described may do so using processors and memory resources. 

What is claimed is:
 1. An electronic device, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: receiving speech input; converting the speech input to a set of phonetic symbols on the electronic device; and transferring data relating to the phonetic symbols to a remote processing device over a communications network; receiving data relating to at least one matching data item from a remote processing device; and presenting data relating to the at least one received matching data item.
 2. The electronic device of claim 1, wherein presenting data relating to the at least one received matching data item comprises: displaying an orthographic representation of the at least one received matching data item.
 3. The electronic device of claim 1, wherein presenting data relating to the at least one received matching data item comprises: outputting data corresponding to a map coordinate of a location represented by a received matching data item.
 4. The electronic device of claim 1, wherein presenting data relating to the at least one received matching data item comprises: outputting data corresponding to an identification code for a media item represented by the received matching data item.
 5. The electronic device of claim 1, wherein presenting data relating to the at least one received matching data item comprises: generating a spoken form of the at least one received matching data item.
 6. The electronic device of claim 1, wherein a number of the matching data items received from the remote processing device corresponds to a lower of: a maximum number of data items to be displayed, or a number of data items arranged in order of decreasing posterior probability corresponding to the phonetic symbols down to a predetermined threshold value.
 7. The electronic device of claim 6, wherein the posterior probability is based on a match score, wherein the match score is generated by matching phonetic symbols to phonetic reference forms corresponding to the matching data items, and wherein the match score is normalized.
 8. The electronic device of claim 1, wherein the set of phonetic symbols comprises: a sequence of phonetic symbols, a lattice of phonetic symbols, or a combination thereof
 9. The electronic device of claim 1, wherein the one or more programs further include instructions for: storing data representing the speech input; and performing a rescoring process using the data relating to the at least one received matching data item and the stored speech input data.
 10. The electronic device of claim 9, wherein the rescoring process further comprises producing a network including data representing phonetic specifications of the received data relating to at least one matching data item.
 11. The electronic device of claim 9, wherein the rescoring process further comprises: generating, for each received matching data item to be rescored, sequences or lattices of acoustic hidden Markov models (HMMs), the HMMs representing phonetic units corresponding to a phonetic specification of reference pronunciations, wherein the reference pronunciations correspond to each received matching data item to be rescored.
 12. The electronic device of claim 11, wherein the rescoring process further comprises: receiving compressed data based on shared common elements in the phonetic specification to reduce an amount of data received at the electronic device for the rescoring process.
 13. The electronic device of claim 11, wherein the rescoring process further comprises: receiving data specifying a general-purpose sub-grammar representing a set of alternative number-word sequences for the rescoring process; and determining, using the sub-grammar, a most likely one of the number-word sequences to be presented.
 14. The electronic device of claim 11, wherein the rescoring process further comprises: receiving phonetic specifications corresponding to the received matching data items, the specifications associated with indices; selecting one or more phonetic specifications to be presented; transferring indices corresponding to each of the selected phonetic specifications to the remote processing device; receiving, from the remote processing device, data representing a full description of data items corresponding to the transferred indices; and presenting the data representing the full description of data items.
 15. The electronic device of claim 11, wherein the rescoring process further comprises: receiving, from the remote processing device, a predetermined number of matching data items corresponding to best matches of the phonetic symbols and data items in a set of data items.
 16. The electronic device of claim 10, wherein each of the sequences or lattices is matched against a sequence of frames of spectrum parameters corresponding to the speech input data stored in the electronic device to produce a match score.
 17. The electronic device of claim 16, wherein the matching of the sequences or lattices involves at least one of a Viterbi time alignment and a full forward probability method.
 18. The electronic device of claim 1, wherein the mobile device comprises a mobile telephone.
 19. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: receive speech input; convert the speech input to a set of phonetic symbols on the electronic device; and transfer data relating to the phonetic symbols to a remote processing device over a communications network.
 20. A method, comprising: at an electronic device having one or more processors: receiving speech input; converting the speech input to a set of phonetic symbols on the electronic device; and transferring data relating to the phonetic symbols to a remote processing device over a communications network. 